Code Generation and Global Optimization Techniques for a Reconfigurable PRAM-NUMA Multicore Architecture
نویسنده
چکیده
In this thesis we describe techniques for code generation and global optimization for a PRAM-NUMA multicore architecture. We specifically focus on the REPLICA architecture which is a family massively multithreaded very long instruction word (VLIW) chip multiprocessors with chained functional units that has a reconfigurable emulated shared on-chip memory. The on-ship memory system supports two execution modes, PRAM and NUMA, which can be switched between at run-time. PRAM mode is considered the standard execution mode and targets mainly applications with very high thread level parallelism (TLP). In contrast, NUMA mode is optimized for sequential legacy applications and applications with low amount of TLP. Different versions of the REPLICA architecture have different number of cores, hardware threads and functional units. In order to utilize the REPLICA architecture efficiently we have made several contributions to the development of a compiler for REPLICA target code generation. It supports both code generation for PRAM mode and NUMA mode and can generate code for different versions of the processor pipeline (i.e. for different numbers of functional units). It includes optimization phases to increase the utilization of the available functional units. We have also contributed to quantitative the evaluation of PRAM and NUMA mode. The results show that PRAM mode often suits programs with irregular memory access patterns and control flow best while NUMA mode suites regular programs better. However, for a particular program it is not always obvious which mode, PRAM or NUMA, will show best performance. To tackle this we contributed a case study for generic stencil computations, using machine learning derived cost models in order to automatically select at runtime which mode to execute in. We extended this to also include a sequence of kernels. This work has been supported by VTT Oulu, Finland and SeRC. Abstract In this thesis we describe techniques for code generation and global optimization for a PRAM-NUMA multicore architecture. We specifically focus on the REPLICA architecture which is a family massively multithreaded very long instruction word (VLIW) chip multiprocessors with chained functional units that has a reconfigurable emulated shared on-chip memory. The on-ship memory system supports two execution modes, PRAM and NUMA, which can be switched between at run-time. PRAM mode is considered the standard execution mode and targets mainly applications with very high thread level parallelism (TLP). In contrast, NUMA mode is optimized for sequential legacy applications and applications with low amount of TLP. Different versions of the REPLICA architecture have different number of cores, hardware threads and functional units. In order to utilize the REPLICA architecture efficiently we have made several contributions to the development of a compiler for REPLICA target code generation. It supports both code generation for PRAM mode and NUMA mode and can generate code for different versions of the processor pipeline (i.e. for different numbers of functional units). It includes optimization phases to increase the utilization of the available functional units. We have also contributed to quantitative the evaluation of PRAM and NUMA mode. The results show that PRAM mode often suits programs with irregular memory access patterns and control flow best while NUMA mode suites regular programs better. However, for a particular program it is not always obvious which mode, PRAM or NUMA, will show best performance. To tackle this we contributed a case study for generic stencil computations, using machine learning derived cost models in order to automatically select at runtime which mode to execute in. We extended this to also include a sequence of kernels. This work has been supported by VTT Oulu, Finland and SeRC.
منابع مشابه
Configuration Code Generation and Optimizations for Heterogeneous Reconfigurable Dsps
In this paper we describe a code generation and optimization process for reconfigurable architectures targeting digital signal processing and wireless communication applications. The ability to generate efficient and compact code is essential for the success of reconfigurable architectures. Otherwise, the overhead of reconfiguring could easily become the system bottleneck. Our code generation p...
متن کاملLocality Optimization on a NUMA Architecture for Hybrid LU Factorization
We study the impact of non-uniform memory accesses (NUMA) on the solution of dense general linear systems using an LU factorization algorithm. In particular we illustrate how an appropriate placement of the threads and memory on a NUMA architecture can improve the performance of the panel factorization and consequently accelerate the global LU factorization. We apply these placement strategies ...
متن کاملDesign of a novel congestion-aware communication mechanism for wireless NoC architecture in multicore systems
Hybrid Wireless Network-on-Chip (WNoC) architecture is emerged as a scalable communication structure to mitigate the deficits of traditional NOC architecture for the future Multi-core systems. The hybrid WNoC architecture provides energy efficient, high data rate and flexible communications for NoC architectures. In these architectures, each wireless router is shared by a set of processing core...
متن کاملThe case for holistic query evaluation
In this thesis we present the holistic query evaluation model. We propose a novel query engine design that exploits the characteristics of modern processors when queries execute inside main memory. The holistic model (a) is based on template-based code generation for each executed query, (b) uses multithreading to adapt to multicore processor architectures and (c) addresses the optimization pro...
متن کاملAn Analysis of Shared Library Performance on NUMA Architectures
Most modern multicore systems these days are Non Uniform Memory Architecture (NUMA), which means they have multiple memory controllers with non-uniform access latencies across them. There has been significant amount of work done in exploring and mitigating performance penalty due to NUMA overhead. In previous works, NUMA-aware schedulers were proposed, sometimes with an objective of keeping dat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003